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AN EXPERIMENT IN GRADING PROBLEMS IN 
ALGEBRA. 

By Edward L. Thorndike. 

The Difficulty of Certain Problems in Algebra as Judged 
by the Consensus of Two Hundred Teachers of 
Mathematics. 

Two hundred teachers of mathematics, chiefly members of 
the New York Section of the Association of Teachers of Mathe- 
matics of the Middle States and Maryland, ranked the twenty- 
five problems printed below for difficulty, " difficulty " being 
defined as in the instructions appended. The variations in the 
individual opinions were very great, being as shown in Table I. 
It is an interesting exercise to examine this table, and imagine, 
as well as one can, the points of view from which these varying 
estimates were each plausible — to divine, for example, why Prob- 
lem T was rated all the way from easiest to hardest of the twenty- 
five. How much of the variation was due to tenable points of 
view and how much was due to errors of judgment cannot, of 
course, be told until the problem in question has been tested 
with respect to the percentage of pupils able to solve it in the 
time allowed. 

In spite of the great variation, the consensus of the two hun- 
dred teachers gives a fairly clear order of difficulty for the 
problems, as shown at the bottom of Table I. D is easiest; K 
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GRADING PROBLEMS IN ALGEBRA. I25 

is next easiest; A is next; and so on (T and P being closely 
alike and I, X and Y being closely alike) . These position-values 
are, however, of comparatively little use. The significance of 
any one of them depends upon consideration of the entire series, 
and is, as it stands, only a means of relative position, not of the 
arnount of " difficulty" or of anything else. 

List of Problems. 

To be ranked in order of difficulty. See sheet of instructions. 

A. If x -J- 3a = 5a, what does x equal ? 

B. The circumference of a circle = 27rr. ^= ! 3 1 A- r = the 

length of the radius of the circle in question. If the diam- 
eter of a bicycle wheel is 28 inches, how many inches is 
the circumference? 

C. If — =4^, what does x equal? 

5 10 

D. If a = 4 and b = 2, what does a + b equal ? 

x 

- — I 

E. If 2-1 = 0, what does x equal? 

2 

a 

F. A cube containing eight cubic inches was plated with copper. 

The difference in the weights of the cube before and after 
the plating was 0.139 lbs. 1 cubic inch of copper weighs 
0.315 lbs. Form an equation from which the approxi- 
mate thickness of the copper plating could be calculated. 
State whether the approximate estimated thickness by 
your equation would be less or more than the exact 
thickness. 

G. If = 6 and £ = 3, what does V« V 2 ^ equal? 

H. If = , what does x equal ? 

a x x 

I. A man has a hours to spend riding with a friend. How far 
can they ride together, going out at the rate of b miles an 
hour, and just covering the return trip at the rate of c 
miles an hour? 
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/.If — — = — — — , prove that a = c or that a-\-b 4- c 4-d = o. 
o + c a + a 

K. If a = 4 and b = o, what does a-\-b equal? 

L. If 3.1- + 4 = 2.r -\- 8, what does .r equal? 

-£ _L_ /T v ft gr™ 

M. If 5 = 1, what does x equal? 

x — a j x + a or — xr 

N. There are two thermometers or scales to measure tempera- 
ture. The Fahrenheit scale (F.) is the one we commonly 
use. The other is called the Centigrade scale (C). A 
temperature of 32 degrees on the F. scale = degrees on 
the C. scale. 33.8 degrees on the F. scale =1 degree on 
the C. scale. 35.6 degrees on F. = 2 degrees on C. 50 
degrees on F. = 10 degrees on C. 14 degrees on F. 
= — 10 degrees on C. 

(a) What on the C. scale =■ 70 on the F. scale. 

( b) What on the C. scale = 4 degrees below zero on the 

F. scale ? 

(c) What on the F. scale = 20 degrees on the C. scale ? 

O. If a = 3 and b = 2, what does a 2 — ab equal? 

P. If .i' — • 2a -\- b — 2.r -f- 2b — 4a, what does x equal ? 

Q. If — 4 — + — — = , , 37 — , what does x equal ? 
x + 2 x + 3 x- + 5 x + 6 

R. Let / stand for the safe load that can be hoisted by a hemp 
rope. Let c stand for the circumference of a rope. If 
/=iooc 2 for any hemp rope, how many pounds are a 
safe load for a hemp rope 2^4 inches in circumference? 

5". If a = 6 and b=i, what does 20b — ab 2 equal? 

T. Find the average midnight temperature for the week in which 
the daily midnight temperatures were 15, 3, o, — 7, — 9, 6, 
and 17 degrees. 

X 

[/. If 7 =:a — b, what does x equal? 

a + b 

V. How much water must be added to a pint of "alcohol, 95% 
pure," to make a solution of "alcohol, 40% pure"? 
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W . Given that 2.x — 31s less than x -f- 5 and that 1 1 -f- 2.r is less 
than 3-r + 5, to find the limits (i. e., the values) between 
which x lies. 

(It is understood that the pupil has not had any special training in 
inequalities or limits. This problem is, so to speak, an original exercise.) 

X. At what time between 6 and 6:30 o'clock are the hands of 
a watch at right angles to each other? 

Tr T „ a + b , , (x — a\ 3 x — 2a -\- b ... 

Y. If .r= what does I ■ , equal? 

2 \ x — b J x -\- a ~ 20 

Instructions for the Experiment. 

Cut up the sheet of problems so as to have twenty-five strips, 
each strip with one problem. Examine the problems with a view 
to ranking them in order of difficulty for pupils of age from 14 
years o months to 15 years 11 months, who have had twenty 
weeks' work in algebra, five periods a week, or its equivalent. 
Mark the easiest one, 1 ; mark the next easiest one, 2 ; mark the 
next easiest, 3 ; continue, the hardest problem being marked 25. 
If two or more of the problems seem to be of equal difficulty, 
choose one of them at random for the easiest of them, and so 
on. That is, suppose that you have already picked out as to 
difficulty your Nos. 1, 2, 3, 4, and 5, and that you then find three 
problems that seem equally difficult. These are to fill ranks 6, 7 
and 8. Shuffle them, take one for 6, another for 7, and another 
for 8, at random. 

In ranking the problems let "more difficult" mean, for any 
example, " likely to be solved correctly in thirty minutes by a 
smaller percentage of the pupils." That is, assume that thirty 
minutes are allowed for each problem and try to prophesy the 
ranking of the problems by the number out of ten thousand pupils 
(of the sort stated above) who would fail with each problem. 
If you prophesy that 2,000 pupils will fail to get Problem A and 
1,900 will fail on Problem B, then call A more difficult than B. 

The amounts of difficulty ascribed to these problems by the 
consensus of the two hundred teachers can be inferred from 
the percentage of them judging each example to be harder than 
each other example. If, that is, eighty per cent, of them judged 
K to be harder than D, while 90 per cent, of them judged A to 
be harder than D, and 99 or 100 per cent, of them judged O to be 
harder than D, we should (letting d, k, a, and equal, respec- 
tively, the amounts of " difficulty " possessed by D, K. A and O 



128 THE MATHEMATICS TEACHER. 

in the minds of the two hundred teachers) be confident that 

>a, that a > k, and that k > d. If, further, eighty per cent, 
of them judged A to be harder than K, we may be confident that 
a-k is approximately equal to k—d, since the two differences are 
equally often judged correctly by the two hundred judges, there 
being a minority above zero. 

Counting up the 200 decisions in the case of each problem's 
comparison with every other, we have a table of which Table 
II. shows the first three lines in part as a sample. This table 
reads: 83 per cent, of the teachers judged that /e>rf; 93 per 
cent, of them judged that a > d; 94^ per cent, of them judged 
that Z > d; 100 per cent, of them judged that o > d ; 82 per cent, 
of them judged that o> k; 88^ per cent, of them judged that 

1 > k ; 97 per cent, of them judged that > k ; etc. 

TABLE II. 

The Frequency of Judgments -hat " K is more Difficult than D," 

that "A is more Difficult iAan D," that " L is more Difficult 

than D," etc.. etc. In percentages. 

KALOSTPUGBCQ etc. 
D 83 93 94J4 100 100 etc. 
K 82 $&y 2 97 07 97 etc. 

A 67 64^ 71 83^2 99 etc. 

L 59H 64 72.y 2 ggy 2 93 etc. 

O 

s 

etc. 

It is possible to use knowledge of the relation that holds good 
between (1) the perecentage of judges judging a certain differ- 
ence correctly and (2) the amount of the difference, so as to 
infer the latter from the former. I will not, however, enter into 
either the principles or the technique of doing so for the case in 
hand, since a much simpler method of defining the amounts of 
difficulty of a series of these problems is available. 

Let a, b, c, etc., equal respectively the " difficulty" of A, " diffi- 
culty" of B, etc., in the opinions of the two hundred teachers, 
as heretofore. Then the table from which Table II. is an ex- 
cerpt would show the following : 
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83 % of the judges judged that k > d 
82 % of the judges judged that a > k 
8354% of the judges judged that t > a 
81^2% of the judges judged that h > t 
82^% of the judges judged that e > h 
81 % of the judges judged that t > e 

Hence, approximately, k-d = a-k; a~k — t-a; t-a = h-t; h-t 
= e-h ; and e—h = i—e. 

Call the amount of difference in " difficulty " necessary in 
order that 82*4 per cent, of the judges should rate the harder 
of two problems as harder, dif. 

Then, approximately, 

the difficulty of D in the minds of the 200 judges is d 
the difficulty of K in the minds of the 200 judges is d + I dif. 

the difficulty of A in the minds of the 200 judges is d + 2 dif. 

the difficulty of T in the minds of the 200 judges is d + 3 dif. 

the difficulty of H in the minds o f the 200 judges is d -\- 4 dif. 

the difficulty of E in the minds of th_ 200 judges is d + S dif. 

the difficulty of I in the minds o^ the 200 judges is d + 6 dif. 

The difficulty of D, the difficulty of K, the difficulty of A, and 
so on, can now be stated as we state the weights of certain 
objects, or the amounts of wealth possessed by certain men, if 
we can estimate the amount of difficulty of D itself — that is, 
the difference between the difficulty of D and zero, or "just not 
any," difficulty. 

A problem in algebra that approached zero difficulty might 
be variously defined, but perhaps the most useful definition 
would be: "A problem that is truly algebraic, not simply a prob- 
lem for quantitative thinking in general, and certainly not simply 
a problem for thought in general; and that has some difficulty, 
but less than any other such problem." For example, the prob- 
lem, " Call one dime d, call two dimes 2d, call three dimes 3^. 
How many cents are there in 4d ? " is truly algebraic and would, 
by many competent judges, be rated as easier than D. The 
problem, " Call one dime d, call two dimes 2d, call three dimes 
2,d. How many cents are there in 3d?" is also truly algebraic 
and is perhaps still easier. Somewhere amongst the algebraic 
problems that can be devised that are easier than D will be found 
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one (call it a) nearly at the point of "just not any - ' difficulty. 
By "nearly" is of course meant that from the true zero of 
difficulty to the difficulty of a is a small fraction of the difference 
between the true zero and, say, the difficulty of E or I. 

The difference between " zero difficulty " as an algebra prob- 
lem and the difficulty of D has not been determined. It is almost 
surely, in the opinion on the two hundred, less than 2 dif. and 
probably a little, if at all, over i dif. Call it i dif. That is, 
assume that zero, or "just not any" difficulty as a problem in 
algebra is represented by a problem as much easier than D as D 
is easier than K. 

Then 

the difficulty of D = I dif. 

the difficult}' of Kz=2 dif. 

the difficulty of A = 3 dif. 

the difficulty of T = 4 dif. 

the difficulty of H = 5 dif. 

the difficulty of E = 6 dif. 

the difficulty of 1 = 7 dif. 

And T is twice as difficult as K in the sense of being twice as 
far from zero on a scale for " difficulty " as defined by the con- 
sensus ; E is one and a half times as difficult as T, twice as diffi- 
cult as A ; etc., etc. 

By a fairly permissible hypothesis about the nature of the 
individual differences of these 200 teachers' powers of judging 
these differences in difficulty, the constant which I have here 
called " 1 dif." equals 1.371 times the amount of added difficulty 
which would cause 75 per cent, of these teachers to rate the 
harder of two problems as harder.* Since in other work with 
educational scales, this "P.E." or " 75%-difference" has been 
taken as a unit, we ma}' well restate the scale above as 

* An account of the nature of this hypothesis and the consequent deri- 
vation of the 1.371 P.E. is beyond the scope of this paper. Its essential 
feature is the assumption that the factors which make the judgments of 
two hundred teachers of mathematics concerning a problem's difficulty 
vary one from another can be arranged in n groups, these groups being 
approximately equal in magnitude of influence, and approximately uncor- 
related in action, and that n is fairly large (say, IS or more). 
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the difficulty of D = 1.37 P.E. 

the difficulty of A' = 2.74 P.E. 

the difficulty of .4=4.11 P.E. 

the difficulty of T = 5-|8 P.E. 

the difficulty of H = 6.85 P.E. 

the difficulty of E = 8.22 P.E. 

the difficulty of / = 9.59 P.E. 

73 r A P er cent, of the two hundred judges judged that v > i; and 
7^/2 per cent, of them, that iv > v. For reasons which I shall 
here omit, we may infer that, approximately, 

v — {=.71 dif. or .98 P.E. 

w — 2,'= .85 dif. or 1. 17 P.E. 
and 

v = 7-7 d'f- or la 57 P-E. ; Tt'= 8.55 dif. or 11.74 P.E. 

So our entire series of standards, by the consensus of the two 
hundred judges, may be taken roughly as 

d, the difficulty of D, = 1.4 P.E. or 1 dif. 
k, the difficulty of K, = 2.7 P.E. or 2 dif. 
a, the difficulty of A, = 4.1 P.E. or 3 dif. 
t, the difficulty of T, = 5-5 P-E. or 4 dif. 
h, the difficulty of //. = 6.9 P.E. or 5 dif. 

e, the difficulty of E,= 8.2 P.E. or 6 dif. 
i, the difficulty of I, = 9.6 P.E. or 7 dif. 

v, the difficulty of V, = 10.6 P.E. or 7.7 dif. 
w,. the difficulty of W, = 11.7 P.E. or 8.5 dif. 

These results are sound as an expression of the approximate 
amounts of " difficulty " of the problem in question in the opinion 
of the two hundred teachers in question. And such a consensus 
as treated gives estimates of. these amounts of "difficulty" far 
superior to the opinion of a single teacher. Until a better scale 
is devised this scale may be freely used as a foot-rule to define 
and measure the difficulty of examples in algebra for pupils who 
have studied the subject for the stated length of time. 

By objective tests of pupils a better scale can be devised. Its 
validity will depend upon the validity of the assumptions made 
concerning the distribution of ability to solve problems in algebra 
in the pupils in question. The general nature of the argument 
in such a case has been fully illustrated by Dr. Buckingham in 
the case of a scale for the spelling-difficulty of words. 
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It will have been understood, I trust, that the extent to which 
" difficulty " as judged by these two hundred teachers is identical 
with "difficulty" as judged by a consensus of, say, English or 
French, or Canadian, or Californian, teachers is a matter to be 
determined by experiment; and, similarly, that the extent to 
which any ten thousand children tested represent any other 
group of children is a matter to be determined by experiment. 
"Difficulty" is, of course, not one thing, but many, varying for 
the same problem with the amount of training and its nature. 
The scale, as given, has just the extent of application that this 
particular consensus warrants. Probably no one realizes the 
limitations of these measures of the "difficulty " of D, K, A, etc., 
so well as the author, who can easily add three to every one of 
the criticisms of such scales that he has received. Yet I un- 
hesitatingly assert that, at the present time, no measures of the 
" difficulty " of D, K, A, etc., for pupils after 20 weeks' work 
in algebra exist that have one half the probability of being true 
that the measures stated here have. 

Individual Differences amongst Teachers in Respect to 
Agreement with the Consensus in Rating 24 Prob- 
lems, G Being Omitted from Consideration. 

The order assigned to the problems by any one teacher must, 
Table I being the facts, as a rule differ from the order assigned 
by the consensus. I have computed the facts for a hundred of 
the teachers. In computing them I have taken the differences 
(regardless of signs) between the teachers's ratings and the cor- 
responding ratings by the consensus, and found the sum for each 
teacher. In this sum G does not figure, since the variation was 
so largely due to varying opinions concerning whether the pupils 
would have been taught the meaning of the V sign. Differ- 
ences for T and P from either 7 or 8 (whichever was nearest) 
were used; differences for /, X and Y from 18 or 19 or 20 
(whichever was nearest) were used. 

The resulting sum of differences from the consensus ranges 
from 16 to 97, the latter result being rather far on toward the 
result that would be got from a random shuffling of the prob- 
lems ! The details are shown in Table III. 
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TABLE III. 

Individual Differences in Amount of Disagreement with the Con- 
sensus in the Order of Difficulty of 24 Problems. 



Quantity. 
Sum of Differences from the 

Consensus of the 200 Judges. 

16-23 
24-31 
32-39 

40-47 
48-55 
56-63 
64-71 
72-79 
80-87 
88-95 
96-103 



Frequency. 

Percentage of 

Teachers . 

3 
10 
18 
29 
18 
14 

4 

2 

1 

o 

1 



These sums of differences represent (inversely) complexes of 
(1) ability to judge as the consensus does when it it right — i. e., 
real ability to judge the "difficulty" of problems, (2) acquaint- 
ance with conditions as to the curriculum in algebra that are 
nearer the average condition, (3) ability to judge as the con- 
sensus does when it is wrong, (4) time and care given to the 
ratings, and (5) accidental or "chance" variations due (5a) 
to the individual's condition at the time and ($b) to the small 
number of problems judged. 

The last (5&) is not a very important cause of the differences 
found amongst teachers. For the unreliability due to sampling, 
though fairly large, is small in comparison with the individual 
differences themselves. For example the two halves of the 
series (A to K and I to F) gave for the five "best" and five 
" worst " individuals the following results : 



Five Best. 




Five Worst. 


Sums of Differences. 


A 


Sums of Differences. 












A 


A to A'. 


l to r. 




A to A". 


L to V. 




8 


8 





39 


32 
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II 


8 


3 


31 


42 


II 
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14 
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35 


43 


8 


II 


13 
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44 


39 
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17 
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8 


39 


58 


19 
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And, in general, the mean square error of any one of the entries 
of Table III. is, so far as the sampling of problems goes, on the 
average under 5. 

It is my opinion that, were all causes of variation save (1) 
removed, the individual differences would still be so great that 
the sum of differences from the correct order would still show 
a range of over three to one. 

Teachers College, 
Columbia University, 
New York. 



